{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 入门云原生AI - 3. 提交MPI任务\n",
    "\n",
    "在这个示例中，我们将演示：\n",
    "\n",
    "* 利用Arena提交分布式MPI的训练任务,并且查看训练任务状态和日志\n",
    "* 通过TensorBoard查看训练任务\n",
    "\n",
    "> 前提：请先完成文档中的[共享存储配置](../docs/setup/SETUP_NAS.md)，当前${HOME}就是其中`training-data`的数据卷对应目录。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.下载TensorFlow样例源代码到${HOME}/models目录"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cloning into '/root/models/tensorflow-benchmarks'...\n",
      "remote: Enumerating objects: 3748, done.\u001b[K\n",
      "remote: Counting objects: 100% (3748/3748), done.\u001b[K\n",
      "remote: Compressing objects: 100% (1170/1170), done.\u001b[K\n",
      "remote: Total 3748 (delta 2557), reused 3748 (delta 2557)\u001b[K\n",
      "Receiving objects: 100% (3748/3748), 1.98 MiB | 0 bytes/s, done.\n",
      "Resolving deltas: 100% (2557/2557), done.\n",
      "Checking connectivity... done.\n"
     ]
    }
   ],
   "source": [
    "! if [ ! -d \"${HOME}/models/tensorflow-benchmarks\" ] ; then \\\n",
    "  git clone -b cnn_tf_v1.9_compatible \"https://code.aliyun.com/xiaozhou/benchmark.git\" \"${HOME}/models/tensorflow-benchmarks\"; \\\n",
    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.创建训练结果${HOME}/output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "! mkdir -p ${HOME}/output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4.查看目录结构, 其中`dataset`是数据目录，`models`是模型代码目录，`output`是训练结果目录。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/root\r\n",
      "|-- dataset\r\n",
      "|-- models\r\n",
      "|   `-- tensorflow-benchmarks\r\n",
      "|       |-- LICENSE\r\n",
      "|       |-- README.md\r\n",
      "|       |-- bower_components\r\n",
      "|       |-- dashboard_app\r\n",
      "|       |-- index.html\r\n",
      "|       |-- js\r\n",
      "|       |-- scripts\r\n",
      "|       |-- soumith_benchmarks.html\r\n",
      "|       `-- tools\r\n",
      "`-- output\r\n",
      "\r\n",
      "9 directories, 4 files\r\n"
     ]
    }
   ],
   "source": [
    "! tree -I ai-starter -L 3 ${HOME}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5.检查可用GPU资源"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NAME                                   IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)\r\n",
      "cn-zhangjiakou.i-8vb2knpxzlk449e7lugx  192.168.0.209  <none>  1           0\r\n",
      "cn-zhangjiakou.i-8vb2knpxzlk449e7lugy  192.168.0.210  <none>  1           0\r\n",
      "cn-zhangjiakou.i-8vb2knpxzlk449e7lugz  192.168.0.208  <none>  1           0\r\n",
      "cn-zhangjiakou.i-8vb7yuo831zjzijo9sdw  192.168.0.205  master  0           0\r\n",
      "cn-zhangjiakou.i-8vbezxqzueo7662i0dbq  192.168.0.204  master  0           0\r\n",
      "cn-zhangjiakou.i-8vbezxqzueo7681j4fav  192.168.0.206  master  0           0\r\n",
      "-----------------------------------------------------------------------------------------\r\n",
      "Allocated/Total GPUs In Cluster:\r\n",
      "0/3 (0%)  \r\n"
     ]
    }
   ],
   "source": [
    "! arena top node"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6.通过Arena提交训练任务, 这里`training-data`在配置[共享存储时](../docs/setup/SETUP_NAS.md)创建.   \n",
    "`--data=training-data:/training`将其映射到训练任务的`/training`目录。  \n",
    "`/training`目录下的子目录`/training/models/tensorflow-benchmarks`就是步骤1拷贝源代码的位置。  \n",
    "`/training`目录下的子目录`/training/output/mpi-benchmarks`就是步骤3创建的训练结果输出的位置。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: JOB_NAME=tf-mpi-benchmarks\n",
      "configmap/tf-mpi-benchmarks-mpijob created\n",
      "configmap/tf-mpi-benchmarks-mpijob labeled\n",
      "service/tf-mpi-benchmarks-tensorboard created\n",
      "deployment.extensions/tf-mpi-benchmarks-tensorboard created\n",
      "mpijob.kubeflow.org/tf-mpi-benchmarks created\n",
      "\u001b[36mINFO\u001b[0m[0004] The Job tf-mpi-benchmarks has been submitted successfully \n",
      "\u001b[36mINFO\u001b[0m[0004] You can run `arena get tf-mpi-benchmarks --type mpijob` to check the job status \n"
     ]
    }
   ],
   "source": [
    "# Set the Job Name\n",
    "%env JOB_NAME=tf-mpi-benchmarks\n",
    "%env USER_DATA_NAME=training-data\n",
    "# Submit a training job \n",
    "# using code and data from Data Volume\n",
    "!arena submit mpi \\\n",
    "             --name=$JOB_NAME \\\n",
    "             --workers=2 \\\n",
    "             --gpus=1 \\\n",
    "             --data=$USER_DATA_NAME:/training \\\n",
    "             --tensorboard \\\n",
    "             --image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \\\n",
    "             --logdir=/training/output/mpi-benchmarks \\\n",
    "             \"mpirun python /training/models/tensorflow-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64     --variable_update horovod --train_dir=/training/output/mpi-benchmarks --summary_verbosity=3 --save_summaries_steps=10\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> `Arena`命令的`--logdir`指定`tensorboard`从训练任务的指定目录读取event\n",
    "> 完整参数可以参考[命令行文档](https://github.com/kubeflow/arena/blob/master/docs/cli/arena_submit_tfjob.md)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7.检查模型训练状态，当任务状态从`Pending`转为`Running`后就可以查看日志和GPU使用率了。这里`-e`为了方便检查任务`Pending`的原因。通常看到`[Pulling] pulling image \"uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5\"`代表容器镜像过大，导致任务处于`Pending`。这时可以重复执行下列命令直到任务状态变为`Running`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "STATUS: RUNNING\r\n",
      "NAMESPACE: default\r\n",
      "TRAINING DURATION: 57s\r\n",
      "\r\n",
      "NAME               STATUS   TRAINER  AGE  INSTANCE                          NODE\r\n",
      "tf-mpi-benchmarks  RUNNING  MPIJOB   57s  tf-mpi-benchmarks-launcher-ph6fr  192.168.0.208\r\n",
      "tf-mpi-benchmarks  RUNNING  MPIJOB   57s  tf-mpi-benchmarks-worker-0        192.168.0.210\r\n",
      "tf-mpi-benchmarks  RUNNING  MPIJOB   57s  tf-mpi-benchmarks-worker-1        192.168.0.209\r\n",
      "\r\n",
      "Your tensorboard will be available on:\r\n",
      "192.168.0.206:31129   \r\n",
      "\r\n",
      "Events: \r\n",
      "No events for pending pod\r\n"
     ]
    }
   ],
   "source": [
    "! arena get $JOB_NAME -e"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](3-1-tensorboard.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "8.实时检查日志，此时可以通过调整`--tail=`的数值展示输出的行数。默认为显示全部日志。\n",
    "如果想要实时查看日志，可以增加`-f`参数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2019-03-02T10:25:06.633100624Z 2019-03-02 10:25:06.632415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N \r\n",
      "2019-03-02T10:25:06.633242941Z 2019-03-02 10:25:06.632923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15111 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 6.0)\r\n",
      "2019-03-02T10:25:06.946553283Z I0302 10:25:06.947077 140068638594816 tf_logging.py:115] Starting standard services.\r\n",
      "2019-03-02T10:25:06.94658461Z W0302 10:25:06.947337 140068638594816 tf_logging.py:125] Standard services need a 'logdir' passed to the SessionManager\r\n",
      "2019-03-02T10:25:06.946588591Z I0302 10:25:06.947422 140068638594816 tf_logging.py:115] Starting queue runners.\r\n",
      "2019-03-02T10:25:08.02147656Z I0302 10:25:08.020608 140189223290624 tf_logging.py:115] Running local_init_op.\r\n",
      "2019-03-02T10:25:08.456852287Z I0302 10:25:08.455843 140189223290624 tf_logging.py:115] Done running local_init_op.\r\n",
      "2019-03-02T10:25:13.438192987Z I0302 10:25:13.437267 140189223290624 tf_logging.py:115] Starting standard services.\r\n",
      "2019-03-02T10:25:13.540538367Z I0302 10:25:13.539715 140189223290624 tf_logging.py:115] Starting queue runners.\r\n",
      "2019-03-02T10:25:14.381412593Z I0302 10:25:14.380504 140185215694592 tf_logging.py:159] global_step/sec: 0\r\n",
      "2019-03-02T10:25:15.258478623Z Running warm up\r\n",
      "2019-03-02T10:25:15.283527742Z Running warm up\r\n",
      "2019-03-02T10:25:20.910169525Z \r\n",
      "2019-03-02T10:25:20.910219717Z tf-mpi-benchmarks-worker-0:8562:8575 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]\r\n",
      "2019-03-02T10:25:20.910241917Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Using internal Network Socket\r\n",
      "2019-03-02T10:25:20.910244787Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384\r\n",
      "2019-03-02T10:25:20.910247406Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO NET : Using interface eth0:172.16.2.154<0>\r\n",
      "2019-03-02T10:25:20.910250216Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO NET/Socket : 1 interfaces found\r\n",
      "2019-03-02T10:25:20.910252844Z NCCL version 2.2.13+cuda9.0\r\n",
      "2019-03-02T10:25:20.911021019Z \r\n",
      "2019-03-02T10:25:20.911034061Z tf-mpi-benchmarks-worker-1:22278:22294 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]\r\n",
      "2019-03-02T10:25:20.911039817Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO Using internal Network Socket\r\n",
      "2019-03-02T10:25:20.911044469Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384\r\n",
      "2019-03-02T10:25:20.915446261Z Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/2DJDBRMOBPLS3OLU7TWWCCX54S:/var/lib/docker/overlay2/l/PFG7JK77ZESMCADSD7OQUEPNNA:/var/lib/docker/overlay2/l/253K6TLSZMLAFF5XYM4QK33IDQ:/var/lib/docker/overlay2/l/7APFLVLMJRQQ5ONOK4TXUO3OXX:/var/lib/docker/overlay2/l/46WR5AGPYCSW6OVYV6SW3YYJXG:/var/lib/docker/overlay2/l/7A4MC2QCGNADQYXXZPTC6KRQIE:/var/lib/docker/overlay2/l/KKLS6NXJ6AKMWODEOHSSPPUXXU:/var/lib/docker/overlay2/l/WROAZLKZPVWBL4D3SZUU4VUUMV:/var/lib/docker/overlay2/l/JVZBJ4BH6U456'\r\n",
      "2019-03-02T10:25:20.915466117Z Unexpected end of /proc/mounts line `R2MOJSMX5EJU5:/var/lib/docker/overlay2/l/MNZI2ZHI7SFFLRKCTI7YME7JA4:/var/lib/docker/overlay2/l/L47TL5KBZKM5ZLI4CHK3565W5B:/var/lib/docker/overlay2/l/MSIKN7UHJEFYRJVSKU3IYRTBJ2:/var/lib/docker/overlay2/l/EL5YF3AEVYU34BCUN5QQZCE4IU:/var/lib/docker/overlay2/l/PPSCQMEKYC3RSLTDPVID2FHVGC:/var/lib/docker/overlay2/l/WPBTAL3YAOWRRQJIY343MCI4PZ:/var/lib/docker/overlay2/l/ZUMMHLCWCI42XIRULUDUYX5UBP:/var/lib/docker/overlay2/l/5U55TSOZJM3OH5TSNLFUXYW4MG:/var/lib/docker/overlay2/l/HTTBG5Z5MV5FWYZC42JWNXQOUT:/var/lib/do'\r\n",
      "2019-03-02T10:25:20.915473649Z Unexpected end of /proc/mounts line `cker/overlay2/l/P2637BN543AV7WYBCZPGF56JET:/var/lib/docker/overlay2/l/KYQO63RWGFEYCBVN6E5HAXNZXM:/var/lib/docker/overlay2/l/NM5K355TS52O4IXHNSULI3YCL7:/var/lib/docker/overlay2/l/IAKHDKK26V4N2EKQNU4RLB64ZW:/var/lib/docker/overlay2/l/SUD2DOE35DUKLVST7QEXX6NIT3:/var/lib/docker/overlay2/l/FJMN2F4MCKYAMICYG7R7YPBOIY:/var/lib/docker/overlay2/l/YBYUCPHIBIUU4BE73LPC6J45HT,upperdir=/var/lib/docker/overlay2/cba173854c5c75b89a4f2ff0d9bac1c04b4f8ec9e03f0d50c5be64dbd1be187f/diff,workdir=/var/lib/docker/overlay2/cba1738'\r\n",
      "2019-03-02T10:25:20.916475876Z Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/FRZVWQYK7OWI3PBMGI5ZYDP6A2:/var/lib/docker/overlay2/l/DCWCZJJWUNVKEVCIWJVX6GSF3U:/var/lib/docker/overlay2/l/PGPIXXECDFLQVN532W2J7D3BIV:/var/lib/docker/overlay2/l/RTVTVZFIASVYRNA2T3E3Z473PE:/var/lib/docker/overlay2/l/3HGT4MVWKBKIWUZKKNU5URGZTY:/var/lib/docker/overlay2/l/M2LIXNGFQSGPEOO4FIKKWJ6MVF:/var/lib/docker/overlay2/l/S4A32QN43C3CKQ3SXW53UR4X5C:/var/lib/docker/overlay2/l/SPX7HOQ3EHDP3GLDRREKA4N65A:/var/lib/docker/overlay2/l/VP7H2BL4QPVEW'\r\n",
      "2019-03-02T10:25:20.916489821Z Unexpected end of /proc/mounts line `Z2JWGQMMECYHE:/var/lib/docker/overlay2/l/WA54EUV7CXDYN2LRD4YB5S3PHA:/var/lib/docker/overlay2/l/MLOKMOIIDA6B2VOWEQYWF45VEH:/var/lib/docker/overlay2/l/TLEMXYSZPS5P4XZ2FSNT3VBNHB:/var/lib/docker/overlay2/l/BR6VU5X3QA2YBWWQLQVKBVDSH3:/var/lib/docker/overlay2/l/J7D2QICKY5MN2UND7K7AGGAKTX:/var/lib/docker/overlay2/l/PEADBYPXXLIRID42BRBQCKVPCJ:/var/lib/docker/overlay2/l/P35PZJL5VWQZHAD4HFKL4BSKT2:/var/lib/docker/overlay2/l/RGDWJ7ARAHYJG7FGVFUHDA6RQC:/var/lib/docker/overlay2/l/4KUXB7XIV4QFXCQJLBYZCCNHID:/var/lib/do'\r\n",
      "2019-03-02T10:25:20.91650441Z Unexpected end of /proc/mounts line `cker/overlay2/l/GV6UAYEUJM2OWUSUICLUBZJDTR:/var/lib/docker/overlay2/l/6YQ632LUIZX4M2HWIDJIYEBNJ4:/var/lib/docker/overlay2/l/FJM6EITOVS53B6EQJFBNZRRNTL:/var/lib/docker/overlay2/l/NRV5CBUOJZ7UFQCOEIXT4M7MK3:/var/lib/docker/overlay2/l/WCF7HFJVI5SVZGAVNYCMOVFIYZ:/var/lib/docker/overlay2/l/VEVETHF2VF64VQUWQWQ4KV6WQZ:/var/lib/docker/overlay2/l/I5IGFWWPRMZ5FW72DA43T3Z226,upperdir=/var/lib/docker/overlay2/4d0c86aec61501de716eefcf7b0ea09f3828888664f1c9cdc667681f56cc33cf/diff,workdir=/var/lib/docker/overlay2/4d0c86a'\r\n",
      "2019-03-02T10:25:20.924272335Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO comm 0x7f7fd02f9100 rank 0 nranks 2\r\n",
      "2019-03-02T10:25:20.931008568Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO comm 0x7f63bc353b60 rank 1 nranks 2\r\n",
      "2019-03-02T10:25:20.931022959Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO NET : Using interface eth0:172.16.1.232<0>\r\n",
      "2019-03-02T10:25:20.931026195Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO NET/Socket : 1 interfaces found\r\n",
      "2019-03-02T10:25:20.932348798Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Using 256 threads\r\n",
      "2019-03-02T10:25:20.932531317Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Min Comp Cap 6\r\n",
      "2019-03-02T10:25:20.932538838Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072\r\n",
      "2019-03-02T10:25:20.936292475Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Ring 00 :    0   1\r\n",
      "2019-03-02T10:25:20.94294648Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO 1 -> 0 via NET/Socket/0\r\n",
      "2019-03-02T10:25:20.94334095Z tf-mpi-benchmarks-worker-1:22278:22294 [0] INFO 0 -> 1 via NET/Socket/0\r\n",
      "2019-03-02T10:25:20.951039327Z tf-mpi-benchmarks-worker-0:8562:8575 [0] INFO Launch mode Parallel\r\n",
      "2019-03-02T10:25:31.778698369Z Done warm up\r\n",
      "2019-03-02T10:25:31.779338161Z Step\tImg/sec\ttotal_loss\r\n",
      "2019-03-02T10:25:34.463195993Z Done warm up\r\n",
      "2019-03-02T10:25:34.468385192Z Step\tImg/sec\ttotal_loss\r\n",
      "2019-03-02T10:25:35.03856711Z 1\timages/sec: 112.4 +/- 0.0 (jitter = 0.0)\t9.102\r\n",
      "2019-03-02T10:25:35.038612058Z 1\timages/sec: 19.6 +/- 0.0 (jitter = 0.0)\t8.976\r\n",
      "2019-03-02T10:25:40.20952965Z 10\timages/sec: 75.9 +/- 8.7 (jitter = 2.0)\t9.101\r\n",
      "2019-03-02T10:25:42.972409109Z 10\timages/sec: 75.3 +/- 8.8 (jitter = 1.8)\t9.106\r\n",
      "2019-03-02T10:25:48.874253241Z 20\timages/sec: 74.9 +/- 6.2 (jitter = 2.2)\t8.978\r\n",
      "2019-03-02T10:25:51.483494563Z 20\timages/sec: 75.3 +/- 6.2 (jitter = 2.3)\t9.182\r\n"
     ]
    }
   ],
   "source": [
    "! arena logs --tail=50 $JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "9.查看实时训练的GPU使用情况"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INSTANCE NAME                     GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)           STATUS   NODE\r\n",
      "tf-mpi-benchmarks-launcher-s6x5c  N/A                N/A              N/A                       Running  192.168.0.208\r\n",
      "tf-mpi-benchmarks-worker-0        0                  97%              15651.0MiB / 16276.2MiB   Running  192.168.0.210\r\n",
      "tf-mpi-benchmarks-worker-1        0                  0%               15651.0MiB / 16276.2MiB   Running  192.168.0.208\r\n"
     ]
    }
   ],
   "source": [
    "! arena top job $JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "10.通过TensorBoard查看训练趋势。\n",
    "您可以通过Tensorboard查看训练趋势， 通过执行 `arena get ${JOB_NAME}`， 您可以获取到tensorboard的集群内网IP，本例中是 `192.168.0.206:31129`。如果您通过笔记本电脑无法直接访问 Tensorboard，可以考虑在您的笔记本电脑使用 `sshuttle`。例如：`sshuttle -r root@41.82.59.51 192.168.0.0/16`。其中`41.82.59.51`为集群内某个节点的外网IP，且该外网IP可以通过ssh访问。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "STATUS: RUNNING\r\n",
      "NAMESPACE: default\r\n",
      "TRAINING DURATION: 1m\r\n",
      "\r\n",
      "NAME               STATUS   TRAINER  AGE  INSTANCE                          NODE\r\n",
      "tf-mpi-benchmarks  RUNNING  MPIJOB   1m   tf-mpi-benchmarks-launcher-ph6fr  192.168.0.208\r\n",
      "tf-mpi-benchmarks  RUNNING  MPIJOB   1m   tf-mpi-benchmarks-worker-0        192.168.0.210\r\n",
      "tf-mpi-benchmarks  RUNNING  MPIJOB   1m   tf-mpi-benchmarks-worker-1        192.168.0.209\r\n",
      "\r\n",
      "Your tensorboard will be available on:\r\n",
      "192.168.0.206:31129   \r\n"
     ]
    }
   ],
   "source": [
    "! arena get $JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "11.查看模型训练产生的结果, 在`output`下生成了训练结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/root\r\n",
      "|-- dataset\r\n",
      "|-- models\r\n",
      "|   `-- tensorflow-benchmarks\r\n",
      "|       |-- LICENSE\r\n",
      "|       |-- README.md\r\n",
      "|       |-- bower_components\r\n",
      "|       |-- dashboard_app\r\n",
      "|       |-- index.html\r\n",
      "|       |-- js\r\n",
      "|       |-- scripts\r\n",
      "|       |-- soumith_benchmarks.html\r\n",
      "|       `-- tools\r\n",
      "`-- output\r\n",
      "    `-- mpi-benchmarks\r\n",
      "        |-- checkpoint\r\n",
      "        |-- events.out.tfevents.1551522695.tf-mpi-benchmarks-worker-0\r\n",
      "        |-- graph.pbtxt\r\n",
      "        |-- model.ckpt-110.data-00000-of-00001\r\n",
      "        |-- model.ckpt-110.index\r\n",
      "        `-- model.ckpt-110.meta\r\n",
      "\r\n",
      "10 directories, 10 files\r\n"
     ]
    }
   ],
   "source": [
    "! tree -I ai-starter -L 3 ${HOME}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "12.删除已经完成的任务"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "service \"tf-distributed-mnist-tensorboard\" deleted\n",
      "deployment.extensions \"tf-distributed-mnist-tensorboard\" deleted\n",
      "tfjob.kubeflow.org \"tf-distributed-mnist\" deleted\n",
      "configmap \"tf-distributed-mnist-tfjob\" deleted\n",
      "\u001b[36mINFO\u001b[0m[0004] The Job tf-distributed-mnist has been deleted successfully \n"
     ]
    }
   ],
   "source": [
    "! arena delete $JOB_NAME"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "恭喜！您已经使用 `arena` 成功运行了训练作业，而且还能轻松检查 Tensorboard。\n",
    "\n",
    "总结，希望您通过本次演示了解：\n",
    "1. 如何准备代码，并将其放入数据卷中\n",
    "2. 如何在分布式MPI训练任务中引用数据卷，并且使用其中的代码和数据\n",
    "3. 如何利用arena管理您的分布式训练任务。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
